Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan-Spanish language pair

نویسندگان

  • Mireia Farrús
  • Marta R. Costa-Jussà
  • José B. Mariño
  • Marc Poch
  • Adolfo Hernandez
  • Carlos A. Henríquez Q.
  • José A. R. Fonollosa
چکیده

This work aims to improve an N-gram-based statistical machine translation system between the Catalan and Spanish languages, trained with an aligned Spanish– Catalan parallel corpus consisting of 1.7 million sentences taken from El Periódico M. Farrús (&) M. R. Costa-jussà J. B. Mariño M. Poch A. Hernández C. Henrı́quez J. A. R. Fonollosa TALP Research Center, Department of Signal Theory and Communications, Universitat Politècnica de Catalunya, C/Jordi Girona 1-3, 08034 Barcelona, Spain e-mail: [email protected] Present Address: M. Farrús Office of Learning Technologies, Universitat Oberta de Catalunya, Av. Tibidabo, 47, 08035 Barcelona, Spain e-mail: [email protected] J. B. Mariño e-mail: [email protected] M. Poch e-mail: [email protected] A. Hernández e-mail: [email protected] C. Henrı́quez e-mail: [email protected] J. A. R. Fonollosa e-mail: [email protected] M. R. Costa-jussà Voice and Language Department, Barcelona Media Innovation Center, Av Diagonal 177, 9th Floor, 08018 Barcelona, Spain e-mail: [email protected] Present Address: M. Poch Universitat Pompeu Fabra, Roc Boronat, 138, 08018 Barcelona, Spain e-mail: [email protected] 123 Lang Resources & Evaluation (2011) 45:181–208 DOI 10.1007/s10579-011-9137-0

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Catalan-English Statistical Machine Translation without Parallel Corpus: Bridging through Spanish

This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. G...

متن کامل

Catalan-English statistical machine translation without a parallel corpus

This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. G...

متن کامل

English-Catalan Neural Machine Translation in the Biomedical Domain through the cascade approach

This paper describes the methodology followed to build a neural machine translation system in the biomedical domain for the English-Catalan language pair. This task can be considered a low-resourced task from the point of view of the domain and the language pair. To face this task, this paper reports experiments on a cascade pivot strategy through Spanish for the neural machine translation usin...

متن کامل

Towards the Use of Word Stems and Suffixes for Statistical Machine Translation

In this paper we present methods for improving the quality of translation from an inflected language into English by making use of part-of-speech tags and word stems and suffixes in the source language. Results for translations from Spanish and Catalan into English are presented on the LC-STAR trilingual corpus which consists of spontaneously spoken dialogues in the domain of travelling and app...

متن کامل

A Large Spanish-Catalan Parallel Corpus Release for Machine Translation

We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catala...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 45  شماره 

صفحات  -

تاریخ انتشار 2011